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Abstract. Clustering is the problem of separating a set of objects into groups (called clusters) so that ob- 
jects within the same cluster are more similar to each other than to those in different clusters. Spectral clus- 
tering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix 
to perform this separation. Since the method relies on solving an eigenvector problem, it is computation- 
ally expensive for large datasets. To overcome this constraint, approximation methods have been developed 
which aim to reduce running time while maintaining accurate classification. In this article, we summarize 
and experimentally evaluate several approximation methods for spectral clustering. From an applications 
standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to iden- 
tify from a set of employees those who are likely to voluntarily leave the company from those who are not. 
Our study sheds light on the empirical performance of existing approximate spectral clustering methods 
and shows the applicability of these methods in an important business optimization related problem. 



1. INTRODUCTION 

Clustering or cluster analysis addresses the problem of separating a set of objects into clusters so that 
objects within each cluster are more similar to each other than to objects in different clusters. The clus- 
tering problem has become ubiquitous in data mining and machine learning with applications ranging 
from image processing to bioinformatics. What one means by clustering, and the type of clustering de- 
sired is application dependent. For example, one may wish to segment an image such as that in Figure[l] 
(a) -(b). In medical imaging, segmentation may aid in the identification of tumors, assist physicians in 
surgery and separate anatomical structures. Computer vision applications utilize clustering methods to 
identify foreign objects in surveillance images or detect road signs for computer piloted vehicles. In sta- 
tistical analysis, the objects to be clustered may represent individuals in a population viewed as a vector 
of personal attributes. For example, we will consider the attrition problem: from a dataset of employ- 
ees one wishes to identify which cluster of employees are likely to voluntarily leave the company and 
which are not. With this problem as our overarching focus, we will consider here and throughout the 
case in which we wish to identify two clusters in the data. One can visualize this type of clustering in low 
dimensions, for example as seen in Figure[l](c), where the "correct" cluster identification is obvious. 

1.1. Contributions. As we discuss later in this section, there are different clustering algorithms such as 
fc-means or spectral clustering. The focus of this article is on spectral clustering, a method which utilizes 
an eigenvector from the so-called data similarity matrix. Computing eigenvectors of such matrices could 
be potentially a very expensive operation. Thus, faster approximation algorithms for spectral clustering 
have appeared in the literature. The first contribution of this article is to summarize and experimen- 
tally evaluate such approximation algorithms. Our second contribution is to apply spectral clustering 
to a modern business optimization related problem which we call the attrition problem: given a set of 
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employees, we would like to separate those who are likely to voluntarily resign from the company from 
those who are not. Such information could be of tremendous value to the company because of the high 
costs to replace the workforce. We present the empirical study of approximation algorithms for spectral 
clustering in Section|2]and the case study to the attrition problem in Sectionji] 

1.2. Clustering via /c-means. The goal of clustering methods is to identify clusters automatically from 
the data input. The k-means clustering method is an approach that separates objects into k clusters so 
that each object is assigned to the cluster whose mean is nearest in the Euclidean sense |8, 19 1. That is, 
given n vectors Xi , X2, . . . , in d-dimensional space, xj e U^, the /c-means method aims to minimize the 
sum of the squared intra- cluster distances: 

k 

^ ^ \\Xj-fli\\i 

i=lXjESi 

where, for / = 1, . . . , /c. Si contains the indices of vectors in the ith cluster, and Hi e denotes the mean 
(center) of vectors in that cluster. 

Although this problem is in general NP-Hard 1 10 1, efficient iterative algorithms have been developed 
that often converge to a locally optimal solution (see e.g. Chapter 20 of |9|). Although variations in the 
method exist, the standard approach due to Lloyd (for k = 2 clusters) consists of repeating the two steps 
described in Algorithm[l] We denote by the complement of the set S. 

Algorithm 1 /c-means Clustering Method (for k = 2) 

1: procedure (x/s, [ii, [12^ T) > data points xj e U^, initial means Hi, ii2> number of iterations T 

2: for r= l,2,...,r do 

3: Cluster by assigning each object to its closest mean: 

Si = {Xj : \\Xj - III II2 < llx^- - H2 II2}, S2 = Si 
4: Update the mean vectors: 



1 ^ 1 V 



5: end for 
6: end procedure 



To separate the points into more than 2 clusters, one extends the method for /c-means in the natural 
way. The runtime of the method is 0[knT). When the mean of each cluster converges toward the true 
cluster center, the /c-means method performs well. This is the case, for example, when the clusters are 
each of similar size and have a spherical shape as seen in Figure[2](a). However, when the clusters are not 
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linearly separable, as in Figure[2](b), /c-means may often incorrectly assign points to clusters. Although k- 
means performs well in many settings, there are also applications where these limitations are apparent, 
and this leads us to search for other methods that will work for more general purposes. 



++ ^++++ 



+ + + 

+ + 



(a) 



(b) 



Figure 2 The /c-means clustering method: (a) two clusters in two dimensions with cluster means converged to 
cluster centers (marked with stars); (b) non-spherical clusters are difficult to identify via /c-means clustering. 



1.3. Spectral Clustering. An alternative way to approach the clustering problem is to view the data 
points as a graph. Each vertex of the graph will represent a data point, and each edge will represent 
the similarity between the two corresponding vertices. To that end, for n data points xi,X2,...,x^ in 
d-dimensional space, denote by X the nx d data matrix whose rows contain the data vectors xj. We 
construct a similarity matrix W e U^""^ whose (/,7)th entry gives the similarity between the two corre- 
sponding data points: 



Wij = exp 



\\Xi-Xj\ 



(1) 



where atj is a tuning parameter to be chosen later. The similarity matrix W induces a complete graph 
{V,E, W) where V is the set of vertices (objects) to be clustered, E is the set of edges, and W represents 
the weights of the edges. The clustering problem can then be viewed as the partitioning of the graph 
into sets of vertices such that the edges within the sets have large weights, and the edges across sets have 
small weights. Formally, in the 2-clustering setting, we wish to identify sets A and B which minimize the 
so-called normalized cut objective, 

cut(A5) cut(A5) 

Ncut(A5) = 



assoc(A V) assoc(5, V) 
where the cut and association functions are defined by 

cut(A,5)= assoc(A,y)= ^ Wtj, and assoc(5,y)= ^ Wfj. 

XiEA XiEA XiEB 

XjEB ^j^^ ^j^^ 

The numerators of Ncut defined in this way guarantee that the weights between the clusters A and B 
are small. On the other hand, if we simply minimized the cut function, one might obtain cuts for which 
A is a very small set of vertices (perhaps even just one vertex) and B is the remaining vertices, as shown 
in Figure|3](a). To avoid these trivial cuts, we divide by the association function, which sums the weights 
between a set of vertices and all nodes. If a set of vertices in the partition is too small, its association will 
be small, leading to a large Ncut. With this normalization, one hopes to avoid this type of bias and obtain 
cuts as in Figure[3](b). 

Minimizing the normalized cut is NP-Complete in general (see [15] for the proof; originally due to 
Papadimitriou). However, recently a relaxation of the optimization problem has been reduced to an 
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(a) 



(b) 



Figure 3 Two examples of graph partitioning (one set is shown shaded and the other unshaded): (a) Minimiz- 
ing the cut of the graph, (b) minimizing the normalized cut of the graph. 



eigenvector problem \l5i. Given the nx n similarity matrix W, one defines the normalized Laplaciaii[^ 
matrix! e[R^''" by 

L = D~^''^[D-W)D~^''^, (2) 
where D e U^""^ is the diagonal matrix of degree nodes, 

Dii = Y.^ij. (3) 

7 

Shi and Malik argued that the eigenvector corresponding to the second smallest eigenvalue of L corre- 
sponds to a linear transformation of the relaxed solution to the Ncut problem 1 15] . Indeed, the clustering 
is then performed by selecting an appropriate threshold, and assigning indices of the eigenvector with 
large values to one cluster, and indices with small values to the other. For example, with the eigenvector 
plotted in Figure [4j the first 400 data points would be assigned to one cluster and the second 400 to the 
other. This gives rise to the following formal definition of the spectral clustering algorithm. 




Figure 4 An example of an eigenvector obtained from the spectral clustering method (horizontal axis repre- 
sents the index, vertical the value of the eigenvector at that index). Here we assign the first 400 objects to one 
cluster and the second 400 to the other. 



Algorithm 2 Spectral Clustering Method (for two clusters) 

1: procedure (X, cr) > nx d data matrix X, tuning parameter a 

2: Construct the similarity matrix M/^ in ([l), degree matrix D in (|3) and Laplacian L in (|2) 

3: Compute the eigenvector corresponding to the second smallest eigenvalue of L 

4: Assign the indices in the eigenvector with large values to one cluster, the rest to the other 

5: end procedure 



The step which is most computationally burdensome is the eigenvector computation. In general this 
step yields an 0{n^) running time. This cost is often detrimental for large appHcations and is one of the 
biggest drawbacks to spectral clustering methods. 



Note that one may also consider the Laplacian L = D-W. 
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Experimental results using the spectral clustering method are shown in Figure [5] Here we use the 
self-tuning approach by Zelnik-Manor and Perona [13] for obtaining the similarity matrix. Consider the 
vector in defined entrywise by 

Vi = Wxi-Xi^hy (4) 
where x/^ denotes the K^^ closest neighbor to x/. We then set the scaling parameter aij as 

(Tij=ViVj. (5) 

As in the article, we set icT = 7 for all our experiments with the spectral clustering algorithm. The exper- 
iment performed on the interlocked rings, interlocked half rings, and Gaussian strips were run on Intel 
Core 2 Duo E8500 3.16 GHz machines with 6 MB cache and 16 GB memory. The concentric spheres 
and concentric rings experiments were run on Intel Core 17 870 2.93 GHz machines with 4 cores, 8 MB 
cache and 16 GB memory. The tangent spheres experiments were run on an Intel Xeon W3520 2.67 GHz 
machine with 4 cores, 8 MB cache and 16 GB memory. 
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Figure 5 Accurate clustering results for data sets of many shapes and sizes using spectral clustering: (a) Gauss- 
ian strips, (b) Interlocked half rings, (c) Concentric rings, (d) Concentric spheres, (e) Tangent spheres, and (f ) 
Interlocked rings. 

The running times and dataset sizes are summarized in Table[lj 



Dataset 


Size n 


Running time (s) 


Gaussian strips 


200 


0.17 


Interlocked half rings 


373 


0.73 


Concentric rings 


800 


8.02 


Concentric spheres 


5000 


855.12 


Tangent spheres 


10000 


6365.4 


Interlocked rings 


10000 


6870.7 
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Although the spectral clustering method performs accurate cluster identification even for datasets 
challenging to the /c-means method, its cubic runtime is a practical obstacle in many applications. For 
this reason, approximation methods have been developed which efficiently approximate the spectrum 
of the Laplacian matrix L. In this article, we consider four popular approximation methods. The Fast 
spectral clustering method (20l and Extensible spectral clustering method [181 both identify a small set 
of representative points from the dataset, perform spectral clustering on this much smaller set of points, 
and extend the identification to the remaining data points. An alternative way to reduce the dimension of 
the similarity matrix is to randomly sample values from the matrix to obtain a smaller submatrix, which 
is the basis of the Spectral clustering on a budget \ 14 1 and Nystrom methods [5] . 

1.4. Organization. The remainder of the article is organized as follows. The four approximation meth- 
ods are described and discussed in Section[2] Section[3]displays numerical results used for a comparison 
between the methods. In Section [i] we focus on the attrition problem and analyze how each method 
performs at that task. We conclude in Section[5]with a discussion of the findings. 

2. Approximation Methods 

The fundamental idea behind efficiently approximating the spectral clustering method is to reduce 
the problem size to be clustered. To maintain accurate cluster identification, one hopes that the reduced 
problem preserves the same cluster structure as the original problem. The two main approaches to this 
goal that we discuss here rely either on randomness to reduce the dimension, or some preprocessing 
algorithmic step to ensure that the smaller set is a good representation of the original. To evaluate ac- 
curacy of the method one uses the results of spectral clustering as the ground truth, and compares the 
output of the other methods to that. To evaluate in a general sense, rather than example to example, one 
may wish to compare the eigenvector computed by the approximation method to that of the spectral 
clustering method. We discuss these notions and describe the methods in the remainder of this section. 

The common theme between approximate spectral clustering methods is that the n x n similarity 
matrix W is downsampled so that clustering can be performed efficiently. Such downsampling will of 
course lead to errors in the computed eigenvectors, and one wishes to quantify the magnitude of such 
perturbations to validate the accuracy of the approximate method. In a general context, we can view this 
process as the perturbation of the Laplacian) matrix L = L + E where E is some error matrix and L is the 
perturbed Laplacian matrix. Standard results from linear algebra guarantee the following bound on the 
perturbation of eigenvectors. 

Theorem 2.1 (Eigenvector Perturbations (20l|6l[T6l). Suppose L = L + E and denote by Vf and Vf the ith 

eigenvectors ofL and L, respectively, corresponding to the i th smallest eigenvalue. Then 

l|l^-l^2ll2<^^ll^ll+0(||E||2), 
A2- A3 

where A/ denotes the i th smallest eigenvalue ofW. 

This result shows that the perturbation in the eigenvectors is controlled by the (spectral) norm of the 
perturbation in the matrix, and the eigengap A2 - A3. This theory can be extended to bound the angles 
between eigenspaces of the original and perturbed matrices as well as the norm of their projections (T]. 

In analyzing approximation methods, one wishes to determine the tradeoff between accuracy and 
efficiency. To quantify theoretically and empirically the performance of an approximation method, we 
define the mis -clustering rate by 

1 " 

P=-L^i. (6) 

where H/ is the indicator function which is equal to 1 if object x/ is clustered correctly and zero other- 
wise. We assume here that the correct clustering is given by the spectral clustering method, and compare 
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the results of the approximation methods against that standard. One can then bound the mis-clustering 
rate by the difference in the eigenvectors under certain assumptions. 

Theorem 2.2 (Misclustering rate (6l [2Q1). Suppose L = L + E and denote by Vz and Vz the 2nd (small- 
est) eigenvectors ofL and L, respectively. Then when both sets of eigenvectors partition the data into two 
clusters and the perturbations in the eigenvectors satisfy the componentwise assumptions of\6\, 

p< \\vi-V2\\l. 

This result motivates the development of approximation methods which yield small perturbations in 
the eigenvectors of the downsampled matrix. 

2.1. Fast Spectral Clustering. The fast spectral clustering algorithm by Yan et al. 1 20 1 consists of two ma- 
jor parts: data preprocessing and spectral clustering. The goal of the data preprocessing is to construct 
a smaller, but representative set of points to undergo spectral clustering rather than the original large 
dataset. Since the /c-means method itself identifies k representative points (usually the cluster means), 
it seems a natural way to identify a representative set of points even if k is larger than the number of 
clusters. One can then perform spectral clustering efficiently on the representative points, and assign 
clusters to the entire original dataset by simply choosing the cluster containing the closest representa- 
tive point. Indeed, the algorithm for fast spectral clustering is described in Algorithm[3]below. 

Algorithm 3 Fast Spectral Clustering Method 

1: procedure (X, k, T)> nxd data matrix X, number of representative points /c, number of iterations T 
2: Find k representative points (centroids yi, yk) via /c-means 

3: Create a correspondence table that associates each with the nearest cluster centroid yj; 
4: Run spectral clustering on the data matrix Y of centroids to obtain a clustering assignment 
5: Use the correspondence table to recover cluster membership for each point x/ 

6: end procedure 



The complexity for the /c-means step is 0{knT), and since spectral clustering is only run on the k rep- 
resentative points, that step yields a cost of just 0(/c^). The remaining assignment steps cost at most 0[n)y 
yielding an overall runtime ofO[knT + k^). This is of course a significant improvement over the cubic 
0[n^) of spectral clustering when k and T are chosen small enough. To quantify precisely this tradeoff 
between efficiency and accuracy, the perturbation theory of Theorem |2.2| can be utilized. Indeed, results 
on fast spectral clustering guarantee the following bound on the mis-clustering rate. 



Theorem 2.3 (Spectral misclustering rate |20|). Assume that the assumptions ofTheorem \2.2\ hold. Then 
the mis -cluster ting rate p fS) for fast spectral clustering satisfies 

where || • ||| denotes theFrobenius norm, L andL denote theLaplacian and perturbed Laplacian, and the 
symbol < implies that higher order terms are ignored in the relation. 

This result demonstrates that the mis -clustering rate is again controlled by the eigengap and the per- 
turbations in the Laplacian incurred via fast spectral clustering. The latter term can be bounded in spe- 
cial cases, see |20 | for details. 

2.2. Extensible Spectral Clustering. The notion of identifying a small representative sample of the data 
on which to initially perform spectral clustering can also be generalized. This class of methods are given 
the name extensible spectral clustering (eSPEC) |12l|2][18l. Here, one performs spectral clustering on the 
representative sample of the data, and assign each object in the original dataset to cluster based on its 
m closest neighbors within the representative sample. We again use the similarity matrix ([T| to measure 
"closeness." The general model is described in Algorithmji] 
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Algorithm 4 Extensible Spectral Clustering Method 

1: procedure (X, m, S) > nx d data matrix X, neighboring parameter m, representative sample S 
2: Run spectral clustering on the representative sample S to obtain a clustering assignment 
3: For each object i in S^, find its m closest neighbors in S 

4: Assign each object i to the cluster containing the majority of its m closest neighbors 

5: end procedure 



There are of course many ways one can initially obtain the representative sample. As in the fast spec- 
tral clustering method, one can utilize the /c-means method to identify a good representative sample S. 
Indeed, if the centroids found by the /c-means method coincide with data points in the set, the extensible 
spectral clustering method with m = 1 is the same as the fast spectral clustering method. Alternatively, 
the representative sample can be chosen randomly. For example, one can simply sample uniformly at 
random from the dataset (see e.g. 1 12] El US I and references therein) or according to some other prob- 
ability distribution such as one that assigns probabilities proportional to the norms of each column |4l. 
In the experiments section below, we see that using uniform sampling with just m = l provides accurate 
results even for reasonably small sample sizes. In this case the running time of the method is dominated 
by the size of the sample, 0(|Sp). 

2.3. Nystrom Method. Both the fast spectral clusterting method and extensible spectral clustering re- 
duce the dimension of the clustering problem by subsampling the objects in the data. An alternative 
approach is to subsample the similarity matrix W (T) itself. In this case, one uses a submatrix of W 
and asks that the submatrix approximates the entire matrix W well. This is the motivation behind the 
Nystrom method fTTlfTllBi. 

To that end, we decompose the nx n similarity matrix W so that 

-(Sill')' 

where Wn e [R^^^, W21 e [R^"-^)><^, and W22 e [r("-"^)^("-^). Choosing m « n, W22 is very large, and 
this is thus the part we wish to approximate. 

To do so, one computes the similarity matrix for only the m sampled data points, represented by M/n . 
The relationship between the sampled data points and the rest of the points is given by W21 . Then only 
Wn and the first m columns of W, denoted = {WuWj-^)^, are needed to compute the Nystrom ap- 
proximation: 

The eigenvectors of W are then used as an approximation to the eigenvectors of W. Unfortunately, 
these approximate eigenvectors are not necessarily orthogonal, a property that is necessary for the spec- 
tral clustering problem. However, when W is positive semidefinite, these eigenvectors can be orthogo- 
nalized efficiently. First, we construct 

Q = Wn + wJwi^W2iwJ, 

Then we compute the eigendecomposition of Q to obtain a matrix U whose columns are equal to the 
eigenvectors of Q and a diagonal matrix A with diagonal entries equal to its eigenvalues. The orthogo- 
nalized approximate eigenvectors of W are then computed as the columns of 

V=WmW~^^UA-K 

which can be used for clustering. The Nystrom method is thus described as follows in Algorithmjsj 
For this process to work, however, one requires that the similarity matrix W be positive semidefinite. 
Therefore, the self-tuning approach given in |5) can no longer be utilized since it will not necessarily 
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Algorithm 5 Nystrom Method 

1: procedure {W, m) > nx n similarity matrix W, sample size m 

2: Decompose the similarity matrix M/^ as in f?) 
3: Compute the approximation W = W^W^^ 

4: SetQ = Wn + w;;} Wj^ W21 

5: Compute the eigendecomposition of Q to obtain eigenvectors U and eigenvalues A 

-i _i 

6: Compute orthogonalization Y - WmW^^^ UA 2 

7: Use the columns of V as approximate eigenvectors for spectral clustering 
8: end procedure 



guarantee positive semidefiniteness. However, using the fact that choosing a equal to a constant yields 
a similarity matrix which is positive semidefinite |3 1, the matrix for this algorithm can be self- tuned by 
setting 



Wij = exp 



^ ||x//v/-x^-/v^-||2^ 



where v/ is defined in (4), and c is some fixed constant. 

However, this self- tuning approach yielded worse results empirically than simply manually setting the 
scaling parameter a. Figure [6] demonstrates the percentage of misclustered data points via the Nystrom 
method with a = 1 and the self-tuning approach, for the tangent spheres data (shown in Figure|5](e)). 




Figure 6 Manually setting a gives better results than the self- tuning method. An example is shown here for 
the tangent spheres dataset for (a) cr = 1 and (b) self- tuning. However, different datasets often favor different 
values of cr. 



The Nystrom method is more efficient than the exact algorithm because it is not necessary to compute 
eigenvectors of the entire dense similarity matrix. It has a time complexity of Oinrri^ + m^), which for 
m<Kn is significantly less than the 0{n^) runtime of the standard spectral clustering method. 

2.4. Spectral Clustering on a Budget. The Nystrom method relies on a submatrix to approximate the 
entire similarity matrix W. There, one usually samples blocks or rows /columns at a time. Alternatively, 
one can simply sample the entries themselves at random. This is the approach of the spectral clustering 
on a budget method 1 14 1 . The aim is to randomly select h different entries in the matrix (for some budget 
constraint b) and store only those. The remaining entries are set to zero, enforcing the approximation to 
be a sparse matrix whose eigenvectors can be computed efficiently. 
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More specifically, the indices for the entries are chosen uniformly at random and without replacement 
from {(/, j) : i < j}. A new matrix W is formed whose entries are given by 

2b 



Wij = Wji = < 



n{n- 1) 







if / = j 

if {ij) is queried 
otherwise. 



(8) 



We thus formulate the spectral clustering on a budget method formally as in Algorithm[6j 



Algorithm 6 Spectral clustering on a budget 

1: procedure {W, m) > nx n similarity matrix W, sample size m 

2: Create W by selecting m entries of W according to (8) 

3: Run spectral clustering efficiently using sparsified approximation W 

4: end procedure 



It is shown that the perturbation in the eigenvectors via the downsampling in this method can be 
bounded with high probability 

Theorem 2.4 (Spectral clustering on a budget 1 14 1). Suppose that the budget isb<\ {n^ - n), and denote 
by vz and vz the 2nd (largest) eigenvectors of the Laplacian L = D-W, and corresponding perturbed 
Laplacian L, respectively. Then 

^5/3 ^3/2 > 



min(||i;2 - 1/2 II2, \\-V2- 1^2 II2) < 



(n^ n^\ 



A2-A3 U2/3 fol/2 

where A/ denotes the i th eigenvalue ofL and the relation < ignores lower order logarithmic factors. 

This result shows that either vi or -vi is close to the true eigenvector and can be used for spectral 
clustering (note that the negative of the eigenvector preserves the same clustering). This closeness is 
again controlled by the eigengap of the Laplacian and the budget size b. If the data are well-clustered, W 
can be sparsified to have 0(^7 log^^^ n) nonzero entries, and then spectral clustering can be performed in 
0{nlogn) time. 



3. Numerical Results for Approximation Methods 

We next describe experimental results for the approximation methods of Section [2] We run each 
method using the datasets shown in Figure [5] Each cluster in these sets is clearly defined, and accuracy 
can thus be easily analyzed. The aim of these experiments is to compare and analyze the relationship 
between sample size, runtime, and accuracy for the approximation methods. We use the convention that 
a z% sample size refers to the percentage z of data used in the sample. For the fast spectral clustering 
method, this size corresponds to the number of centroids utilized, k/n. The error is reported in terms 
of the misclustering rate (6). Although experiments across different datasets were run on machines of 
varying specifications, each algorithm for a fixed dataset was run on the same machine to allow for fair 
comparison. The algorithms were implemented in Matlab as described in the pseudocode of Section|2] 
and the runtimes were computed via the cputime function. The experiments performed on the inter- 
locked rings, interlocked half rings, and Gaussian strips were run on Intel Core 2 Duo E8500 3.16 GHz 
machines with 6 MB cache and 16 GB memory. The concentric spheres and concentric rings experi- 
ments were run on Intel Core 17 870 2.93 GHz machines with 4 cores, 8 MB cache and 16 GB memory. 
The tangent spheres experiments were run on an Intel Xeon W3520 2.67 GHz machine with 4 cores, 8 
MB cache and 16 GB memory. A constant a value was used for the Nystrom method and the spectral 
clustering on a budget method for the interlocked rings dataset (for the values, see the tables below); 
otherwise, the self tuning approaches were used. 
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Tablej2]and Figure [7] display the results for each algorithm on the Gaussian strips dataset, depicted in 
Figure [5](a). For this dataset, the error rate and time for the fast spectral clustering method tend to de- 
crease with small enough representative points k. This is most likely because at some point, the /c-means 
clustering portion of the algorithm controls how the data clusters. To get a small error rate for a small 
enough /c, the /c-means clustering must work well with the dataset, in which case spectral clustering is 
perhaps not necessary (as is most likely the case for a well-separated dataset like the Gaussian strip set). 
However, for datasets for which /c-means does not work well, such as the eye dataset below, we cannot 
assume that a very small k will have the same accurate results. 

As seen in Figure [t] eSPEC generally performs better than the Nystrom method, but fails when we 
take too small of a sample size. Spectral clustering on a budget performs worst overall, yielding the 
highest error rate if too small of a budget is used. A sufficiently large budget will allow the algorithm 
to run faster than the original spectral clustering algorithm, but a larger or slightly smaller budget does 
not significantly change the runtime. The Nystrom method and eSPEC reach a point beyond which an 
increase in running time fails to produce a commensurate decrease in error rate. For small sample sizes, 
time does not change as much, but error rate can increase substantially. Since the Gaussian data consists 
of only n = 200 points, taking a sample as low as 5% might be too small for Nystrom and eSPEC to work 
well. This is not a problem for fast spectral clustering since the data can be clustered with /c-means 
clustering. 



in = 200) 


Fast 


Budget 


Nystrom (cr = 1) 


eSPEC 


Sample Size 


Time 


Error 


Time 


Error 


Time 


Error 


Time 


Error 


2% 


0.0122 


0.0036 


0.2044 


0.4914 


0.0097 


0.208 






5% 


0.0156 


0.068 


0.1017 


0.1308 


0.0112 


0.1344 


0.0287 


0.0379 


10% 


0.0181 


0.0548 


0.078 


0.0015 


0.014 


0.1183 


0.0271 


0.0656 


15% 


0.0212 


0.0046 


0.0889 





0.0218 


0.0719 


0.0275 


0.0364 


20% 


0.0225 





0.0858 





0.0281 


0.0321 


0.0293 


0.005 


25% 


0.0271 





0.0952 





0.0312 


0.0249 


0.0318 


0.0098 


30% 


0.0300 





0.088 





0.0415 


0.0042 


0.0337 





35% 


0.0343 





0.0924 





0.0546 


0.0088 


0.0396 





40% 


0.0371 





0.0877 





0.0621 





0.0446 





50% 


0.0771 





0.0952 





0.1026 





0.0805 






Table 2 The run time and error rate of each sample size for each approximation algorithm, ran on the Gaussian 
strip dataset. 



Tablelsjand Figure [8] depict the experimental results for the interlocked half rings dataset, depicted in 
Figure|5](b). For this dataset, the Nystrom method is ideal across the board. It has the smallest error rate 
and requires the least amount of time. Fast spectral clustering and eSPEC generally follow the same trend 
and are very similar in performance. For small datasets in general, even if /c-means clustering does not 
work well with the dataset, fast spectral clustering is effective. Cells without numerical entries indicate 
regimes for which the parameters were too small for the algorithm to perform. 

The results obtained from the concentric rings dataset (depicted in Figure|5](c)) are displayed in Ta- 
ble [i] and Figure [9] The table shows that spectral clustering on a budget is very inaccurate with this 
dataset when the sample size is below 3%, as it misclusters 1 in 5 points. We see that all methods identify 
the clusters exactly once 10% of the data is used in the sample. 

Table [5] and Figure [To| contain the results of the experiments on the concentric spheres dataset, de- 
picted in Figure[5](d). As the three dimensional analog of the concentric rings, it is not surprising we see 
similar results. It is to be noted that all algorithms cluster perfectly when the sample size is 5% or larger. 

Table[6[and FigurepTjdepict the results of the algorithms run on the tangent spheres dataset, depicted 
in Figurej5](e). This dataset gives us a prime example of how approximate spectral clustering algorithms 
are ideal for large, structured data. The exact algorithm takes 6,365.4 seconds, or almost 2 hours, to give 
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Figure 7 Graphs show the (a) error rate and (b) CPU running time (in seconds) when using different sample 
sizes for all approximation algorithms. Gaussian strip dataset has n = 200 data points and sample sizes range 
from 0-100%. 
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eSPEC 
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Time 


Error 


Time 


Error 


Time 


Error 


Time 


Error 


2% 


0.0215 


0.0881 






0.005 


0.1 


0.0452 


0.1872 


5% 


0.0231 


0.1461 






0.0044 


0.002 


0.0462 


0.2122 


15% 


0.0356 


0.2116 






0.0069 





0.0465 


0.2426 


25% 


0.1017 


0.1639 


0.2805 


0.1396 


0.01 





0.0908 


0.1795 


35% 


0.1335 


0.1312 


0.2493 


0.0595 


0.0128 





0.1114 


0.1892 


45% 


0.1529 


0.0959 


0.2658 


0.0129 


0.0209 





0.1407 


0.1498 


55% 


0.2399 


0.0354 


0.2855 


0.0078 


0.0228 





0.2287 


0.1031 


65% 


0.2874 


0.0229 


0.3170 


0.0077 


0.0456 





0.2852 


0.069 


75% 


0.3526 


0.0139 


0.3382 





0.0708 





0.3594 


0.0229 


85% 


0.4287 





0.3547 





0.102 





0.4711 


0.0056 



Table 3 The run time and error rate of each sample size for each approximation algorithm, ran on the inter- 
locked half rings dataset. 



{n = 800) 
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Budget 


Nystrom {a = 0.5) 


eSPEC 


Sample Size 


Time 


Error 


Time 


Error 


Time 


Error 


Time 


Error 


1.0% 


0.0187 


0.2699 


0.3289 


0.4979 


0.0362 


0.2404 


0.0646 


0.3076 


2.0% 


0.0259 


0.0792 


0.3017 


0.4341 


0.0356 


0.1311 


0.0661 


0.1176 


2.5% 


0.0291 


0.012 


0.2839 


0.2841 


0.0356 


0.1127 


0.0646 


0.0441 


3% 


0.0324 


0.0062 


0.2689 


0.2061 


0.0427 


0.0682 


0.0633 


0.0102 


4% 


0.0674 





0.2424 


0.0564 


0.0378 


0.0258 


0.0643 


0.0046 


5% 


0.0627 





0.2409 


0.0131 


0.0415 





0.0655 


0.0001 


10% 


0.1186 





0.2792 





0.0615 





0.0764 





15% 


0.171 





0.341 





0.0967 





0.127 






Table 4 The (a) run time and (b) error rate of each sample size for each approximation algorithm, ran on the 
concentric rings dataset. 



results, when nearly identical results can be given in seconds using approximations. When just 4.25% of 
the data is sampled, all of the algorithms perform with error less than 0.01, and all under a minute. 
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Figure 8 The (a) error rate and (b) CPU running time in seconds when using different sample sizes for all 
approximation algorithms. Interlocked half rings dataset has n = 373 data points and sample sizes range from 
0-100%. 



Fast 
Budget 
Nystro m 
^eSPEC 




0% 5% 10% 

Percentage of Data Sampled 

(a) 



15% 



0.35 



0.3 



«^0.25 



o 

0.2 



^=^0.15 

0.05 





-•-Fast 
-■-Budget 

-* Nystro m 
^eSPEC 




0% 5% 10% 15% 

Percentage of Data Sampled 

(b) 



Figure 9 The (a) average error rate and (b) running time when using different sample sizes for all approxima- 
tion algorithms. Concentric rings dataset has n = 800 data points and sample sizes range from 0-15%. 



Finally, Table [Tland Figure [12] demonstrate the results of the algorithms run on the interlocked rings 
dataset of Figure|5](f). We again see similar results to the tangent spheres dataset, most likely because 
the structure of the clusters are similar. 



4. Case Study: The Attrition Problem 

Clustering methods have a wide range of applications, ranging from image segmentation to social 
network identification. An apphcation we focus on here is the attrition problem. In this setting, the 
data objects correspond to employees of a particular company, and each data vector contains a hst of 
employee attributes. For example, age, salary, years at the company, number of children, age of chil- 
dren, etc. may all be relevant attributes. From this data, one wishes to distinguish a cluster of employees 
who are likely to leave the company from the other cluster of employees are likely to stay with the com- 
pany. Such a classification allows companies to invest resources appropriately in an effort to maintain 
desired employees, saving significant expense and training time. We analyze here how this problem can 
be solved using spectral clustering methods. Because the number n of employees may be very large, and 
the number of attributes collected about the employees may also be very large, approximation methods 
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1.4546 





0.3716 
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0.0498 
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5.0% 


4.3699 





17.2624 





3.1868 





1.1784 





10.0% 


9.6112 





25.099 





7.3857 





3.6448 





15.0% 


17.528 





67.9837 





15.3998 





10.3045 






Table 5 The run time and error rate of each sample size for each approximation algorithm, ran on the concen- 
tric spheres dataset. 




Figure 10 The (a) average error rate and (b) running time when using different sample sizes for all approxima- 
tion algorithms. Concentric spheres dataset has n = 5,000 data points and sample sizes range from 0-15%. 
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0.0001 


2.5556 


0.0083 


10% 


64.279 


0.0016 


100.6472 


0.000656 


27.7863 


0.0001 


15.0004 


0.0054 


15% 


118.1536 


0.0013 


136.7211 


0.000466 


60.9053 


0.0001 


40.0202 


0.0042 



Table 6 The run time and error rate of each sample size for each approximation algorithm, ran on the tangent 
spheres dataset. 



are crucial to solve this problem efficiently. In contrast to the examples of Sectionjs] datasets in this set- 
ting are not only high dimensional, but accuracy is often difficult to quantify since there may no longer 
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Figure 1 1 The (a) average error rate and (b) CPU running time in seconds when using different sample sizes 
for all approximation algorithms. Tangent spheres dataset has n = 10,000 data points and sample sizes range 
from 0-15%. 



[n = 10,000) 


Fast 


Budget [a = 0.5) 


Nystrom {a = 0.5) 


eSPEC 


Sample Size 


Time 


Error 


Time 


Error 


Time 


Error 


Time 


Error 


0.5% 


1.136 


0.3298 


96.986 


0.3064 


7.429 


0.1676 


0.8998 


0.3252 


1% 


3.554 


0.209 


56.876 


0.0048 


7.575 


0.0059 


0.9284 


0.2958 


1.5% 


4.538 


0.1302 


54.718 


0.0027 


7.759 


0.0011 


0.9466 


0.2335 


2% 


6.21 





57.23 


0.0018 


7.966 


0.0009 


0.9962 


0.1579 


3% 


9.41 





60.79 


0.0014 


8.595 


0.001 


1.1128 


0.0691 
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0.001 
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0.0039 
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0.001 
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Table 7 The run time and error rate of each sample size for each approximation algorithm, ran on the inter- 
locked rings dataset. 



be a notion of "correct" clusters. To overcome this last challenge, we utilize historical data about teacher 
attrition which will allow us to properly identify the appropriate clustering. 

Each year, the National Center for Education Statistics sends out a follow-up survey to teachers origi- 
nally selected for the Teacher Questionnaire in a Schools and Staffing Survey. 4,528 teachers were given 
one of two surveys according to their employment status. Teachers were classified as either stayers, 
movers, or leavers. Stayers are teachers who stayed at their current position, movers are teachers who 
continued teaching, but transferred schools, and leavers are teachers who left the position entirely. Leavers 
took the former teacher questionnaire while stayers and movers took the current teacher questionnaire. 
Both surveys contained different sets of questions; the dataset used in our experiments is made up of 
common questions in both surveys from the 1994-1995 school year. The attributes include household 
income (broken into intervals), marital status (coded as 0/1/2 for never married / married / separated), 
number of dependent children, age of youngest child, and dissatisfaction ratings. Because many teach- 
ers had the same responses in the six variables, a dummy variable was added so that the algorithm would 
recognize that the teachers are different people. The dummy variable was drawn uniformly between 
and 1. 

Spectral clustering of the entire dataset yielded a small cluster on 197 teachers who were never married 
and did not have children or other dependents. All of their household incomes ranged from $60,000 to 
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Figure 12 The (a) average error rate and (b) CPU running time in seconds when using different sample sizes 
for all approximation algorithms. Interlocked rings dataset has n = 10,000 data points and sample sizes range 
from 0-15%. 



$74,999 and none of the teachers expressed dissatisfaction in the survey. The number of stayers, movers, 
and leavers are summarized in Table [s] 

Drawing conclusions about the likelihood of attrition for a group of teachers depends on the classifi- 
cation of movers. From a school's point of view, a mover is an attritor, but from the state's point of view, 
a mover is still a teacher. In Table [8] if movers are considered leavers, then the proportion of attritors in 
Cluster 1 is significantly different from the proportion of attritors in Cluster 2. The results yield a one- 
tailed p-value of 0.01 where teachers with the same characteristics as Cluster 1 are less likely to resign. 
However, if movers are considered stayers, then the proportion of attritors in Cluster 1 is not significantly 
different from the proportion of attritors in Cluster 2. In this case, the one-tailed p-value is 0.1950 where 
teachers with the same characteristics as Cluster 1 are more likely to resign. Because conclusions were 
affected by the classification of teachers who moved, these teachers were not included in most experi- 
ments. However, one can easily use the applications of spectral clustering to cases where teachers are 
considered either one or the other. 

While the age of a teacher's youngest child can provide valuable information about the teacher's age 
itself, setting different values for this variable for teachers without children gave varying results. For 
example, if the youngest child's age is set to 50, a large distance away from the maximum value of 38, 
spectral clustering tends to group teachers with older children together with childless teachers. If the 
youngest child's age is set to -1, a closer but impossible age, spectral clustering tends to group teachers 
with newborns together with childless teachers. In the following experiments, we used - 1 as the age for 
teachers without children, though the technicalities of this variable bring us to question if interaction 
terms ought to be considered when working with data of this nature. 

Spectral clustering was applied on subgroups of teachers, such as teachers without children or unmar- 
ried teachers. Most experiments yielded clusters with a mix of teachers who stayed and teachers who left. 
If teachers who moved were included, they would generally be mixed in both clusters as well. An example 
of this can be seen in Table[8]where we applied spectral clustering to teachers with children; movers were 
removed. Although both clusters contained a mix of stayers and leavers, in a two -proportion Z-test, we 
obtain a one-tailed p-value of about 0.0023, providing evidence that Cluster 2 has a greater proportion 
of teachers who will quit. We conclude that teachers with similar characteristics as those in Cluster 2 are 
more likely to quit. 

One of the more interesting results that spectral clustering produced on the teacher data was the case 
where one of two clusters contained only teachers who left. In this run, movers were removed and only 
teachers without children were considered. Spectral clustering grouped 173 teachers together, all who 

16 



Cung, Jin, Ramirez, and Thompson 



Approximation Algorithms for Spectral Clustering 



Status 


Cluster 1 Cluster 2 


Stayer 
Mover 
Leaver 


92 1,666 
24 1,016 
81 1,649 


Total 


197 4,331 



Status 


Cluster 1 Cluster 2 


Stayer 
Leaver 


100 870 
48 699 


Total 


148 1,569 



Status 


Cluster 1 Cluster 2 


Stayer 
Leaver 


788 
173 810 


Total 


173 1,598 



Table 8 Resulting clusters of the entire teacher dataset (left), with movers eliminated (center) and with teach- 
ers with children and movers eliminated (right). 



had quit. They were all married and had household incomes in the $60,000 to $74,999 range. The teach- 
ers in this group expressed at most one dissatisfaction in the survey. In Table [8j this group is labeled 
Cluster 1. Cluster 2 consists of all other teachers that were not movers and did not have children. Among 
the 1,598 teachers in Cluster 2, 810 quit. In a two-proportion Z-test, this gave us a one-tailed p-value of 
less that 0.0001, which supports the idea that teachers similar to those in Cluster 1 are more likely to quit 
than those similar to teachers in Cluster 2. 

To ensure that spectral clustering worked and would give us accurate clusters, the clustering algorithm 
was appHed to a 1/3 sample of a subgroup of teachers. Results were used to try to predict the remain- 
ing 2/3. For example, in the case of the 488 unmarried teachers, 163 teachers were sampled (Table[9). 
Spectral clustering grouped 23 of the teachers in one group because they all did not express complaints 
and did not have children or other dependents. Among this first cluster, 17 had quit while 38 of the 140 
teachers in the second cluster quit. This yielded a one-tailed p-value of less than 0.0001 where teachers 
in the first cluster are more likely to quit. Going through the 325 unsampled teachers, 55 displayed the 
same characteristics as the first cluster. Proportionally, our prediction that 41 of the 55 teachers would 
quit was not bad considering that in actuality 38 teachers quit. Our prediction for the second group of 
teachers was further off, but still supported the finding that teachers similar to those in the first cluster 
are more likely to quit than teachers similar to those in the second cluster. Spectral clustering was able 
to group the 55 teachers together in its run with the remaining 2/3 data points. 





With Children Cluster 


Without Children Cluster 


SC on 1/3 


17/23 


38/140 


Predicted 


41/55 


73/270 


SC on 2/3 


38/55 


94/270 



Table 9 Prediction given by spectral clustering for the teacher dataset (married teachers and movers elimi- 
nated). 



Although 1 13 1 recommended using the 7^^ nearest neighbor to help determine the similarity band- 
width of each point, using different values of nearest neighbor for this dataset yielded different clusters 
that provided valuable information. We appHed spectral clustering to married teachers who were not 
movers using the 7^^, 50^^, and 100*^ nearest neighbor. All teachers in the smaller cluster (Cluster A) us- 
ing the 7*^ nearest neighbor were found in the same cluster (with other teachers) when using the 100^^ 
nearest neighbor. That same cluster, with the newly added teachers, in the 100*^ nearest neighbor was 
also found in the same cluster with almost the rest of the teachers when using the 50^^ nearest neighbor. 
Define Cluster B as teachers grouped with teachers in Cluster A using the 100^^ nearest neighbor. Define 
Cluster C as teachers grouped with teachers in Cluster A and B using the 50^^ nearest neighbor. The re- 
maining teachers make up Cluster D. A summary of the characteristics of teachers in each cluster can be 
found on Table [ToJ Note that the table does not reflect correlations; for example, a dissatisfaction value 
of 1 only appears in Cluster B if the teacher does not have a child. 

We found that the difference between all four clusters is statistically significant with a p-value of less 
than 2.2E-16. With this, we can obtain a ranking of the teachers where teachers with 2 or 3 points of 
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dissatisfaction are most likely to resign. Teachers with a young only child and teachers without chil- 
dren, excluding those with the exact characteristics of Cluster A, are very likely to resign. Teachers with 
the exact characteristics of Cluster A could possibly resign, while all other teachers with at most one 
dissatisfaction point are not likely to resign. In summary, a ranking of teachers who are most likely to 
quit teaching is achievable with spectral clustering. It brings us to consider multiple clusters in spectral 
clustering. 



Characteristics 


Cluster A 


Cluster B 


Cluster C 


Cluster D 


Household Income 


60,000-74,999 


Varies 


Varies 


Varies 


Children 


None 


Oto 1 


Varies 


Varies 


Youngest Child 


-1 


-1 to4 


Varies 


Varies 


Other Dependents 


None 


None 


Oto 1 


Oto 3 


Dissatisfaction 


None 


Oto 1 


Oto 1 


2 to 3 


Stayer 


92 


398 


772 





Leaver 


81 


567 


480 


214 


Total 


173 


965 


1,252 


214 



Table 10 Four clusters given by altering the tuning parameter on the teacher dataset (unmarried teachers and 
movers eliminated). 



Our analyses of spectral clustering on the teacher data was facilitated by looking at subgroups of 
teachers. It is perhaps that the variables that create the divide for the subgroups (i.e. unmarried teachers 
only, teachers without children only, etc.) interact with other variables, such as age of youngest child. 
Variable transformations, interaction terms, multiple clusters, and weighting are thus important points 
of consideration when working with spectral clustering on attrition-like data. 

4.1. Approximation Results. To measure the effectiveness of the approximation algorithms on the teacher 
dataset, we ran each one given a different sample size 10 times and compared average run time and er- 
ror rate. The "movers" category was removed for simplicity. As the ground truth, we use the answers 
obtained by the exact spectral clustering algorithm. In other words, we measure the ability of the ap- 
proximation algorithms to give the same answer as exact spectral clustering in a shorter amount of time 
(the exact algorithm ran in 476.47 seconds). 

Although the results are similar to those obtained using visually apparent clusters, we see two major 



differences. First, as seen in Table [TT] spectral clustering on a budget was not necessarily the slowest 
or most inaccurate algorithm of the group. Secondly, as seen in Figure [14] it took a longer time for the 
algorithms to reach zero error. This is potentially due to the use of proximity of the clusters. Perhaps 
spectral clustering on a budget handles less structured clusters better than the others. Still, it displays 
an odd progression in terms of run time - one that is not entirely upward sloping, as seen in Figure [14} 
This leads us to believe it may be unstable in this setting. For this kind of dataset, fast spectral clustering 
or eSPEC may offer more advantages. Alternatively, if it is acceptable for the error rate to be up to 5%, 
Nystrom gives adequate results the quickest. A depiction of the tradeoff between efficiency and accuracy 
for this dataset is given in Figure [Ts] 

5. Discussion 

Fast spectral clustering frequently gives the most accurate results in the shortest running time for 
small datasets using a small k. For easily clustered data, this may be due to the /c-means algorithm 
overpowering the fast spectral clustering algorithm for really small k values. The Nystrom method often 
performs quickly and accurately as well, especially on the larger or more complicated datasets. eSPEC is 
the fastest when n is extremely large, but also the most inaccurate. 
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Figure 13 Error rate versus time (in seconds) plot for the teacher data (AA/ithout movers). 




Figure 14 Sample sizes range from 5-100% of the teacher dataset with n = 3y 488, and without the classification 
of movers. 



{n = 3,488) 


Fast 


Budget 


Nystrom 


eSPEC 


Sample Size 


Time 


Error 


Time 


Error 


Time 


Error 


Time 


Error 


5% 


1.9906 


0.2518 


102.9295 


0.0499 


1.7644 


0.066 


0.4243 


0.1889 


10% 


3.4086 


0.1691 


136.6257 


0.0499 


4.1699 


0.0551 


0.8143 


0.1721 


20% 


10.3694 


0.1735 


204.2521 


0.0502 


14.2835 


0.0601 


5.4663 


0.1709 


30% 


22.8323 


0.1541 


276.855 


0.0493 


37.1969 


0.0517 


16.3941 


0.1618 


40% 


41.5103 


0.1079 


208.5421 





80.0753 


0.0567 


34.2765 


0.1328 


50% 


69.0788 


0.0812 


194.2056 





146.9904 


0.0536 


61.5502 


0.0843 


60% 


106.2819 


0.065 


222.941 





241.4677 


0.0506 


102.1619 


0.052 


70% 


164.4188 


0.07 


223.8302 





371.103 


0.0511 


162.096 


0.0752 


80% 


236.8828 


0.045 


239.8047 





532.3269 


0.0533 


232.819 


0.035 


90% 


329.4538 


0.01 


277.4946 





733.4184 


0.0511 


326.2121 


0.015 



Table 1 1 The run time and error rate of each sample size for each approximation algorithm, ran on the teacher 
dataset. 



Intuitively, we sacrifice accuracy for efficiency when we run the approximation algorithms on a rela- 
tively small set of points compared to the dataset size. However, the trend is not so apparent for spec- 
tral clustering on a budget. The other algorithms face approximately the expected tradeoff. Across all 
datasets, spectral clustering on a budget often takes longer than the other three with hmited advance 
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in accuracy. Interestingly, it consistently reaches a point where smaller sample sizes actually make it in- 
crease in running time. Thus, it may be preferable to utilize one of the other three algorithms, depending 
on the size of the data and the goal of the clustering results. 

In addition, when given the right data, spectral clustering can find similarities in individuals that may 
point to employees at high risk of attrition. Using a subset of the teacher dataset, we found we could 
predict with some accuracy which teachers had left. This shows that if possible, breaking the data down 
and running spectral clustering on smaller groups is very useful. Approximation methods did not per- 
form quite as well on the teacher dataset, but they do give accurate results and cut down run time by a 
few minutes. If run on a larger employee dataset, they would likely increase efficiency by a greater factor. 
With finer tuning of parameters and variable choices, even more improvements may be possible. 

Acknowledgements 

We would like to thank our advisors Christos Boutsidis and Deanna Needell. We also thank Mike 
Raugh, Stacey Beggs, Dimi Mavalski, and all the faculty at the Institute of Pure and Applied Mathematics 
for directing and coordinating this summer. Lastly, thank you to IBM and NSF for funding this project. 

References 

[1] C.T.H. Baker and CTH Baker. The numerical treatment of integral equations, volume 13. Clarendon press Oxford, 1977. 
[2] J.C. Bezdek, R.J. Hathaway, J.M. Huband, C. Leckie, and R. Kotagiri. Approximate clustering in very large relational data. 

International journal of intelligent systems, 21(8):817-841, 2006. 
[3] M. Cuturi. Positive definite kernels in machine learning. arXiv preprint arXiv:091 1.5367, 2009. 

[4] P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices ii: Computing a low- rank approxima- 
tion to a matrix. SIAM Journal on Computing, 36(1):158-183, 2006. 

[5] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the nystrom method. Pattern Analysis and Machine 
Intelligence, IEEE Transactions on, 26(2):214-225, 2004. 

[6] L. Huang, D. Yan, M.I. Jordan, and N. Taft. Spectral clustering with perturbed data. Advances in Neural Information Pro- 
cessing Systems (NIPS), pages 705-712, 2008. 

[7] B. Hunter and T. Strohmer. Performance analysis of spectral clustering on compressed, incomplete and inaccurate mea- 
surements. Submitted, 2011. 

[8] S. Lloyd. Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2): 129-137, 1982. 

[9] D.J.C. MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. 
[10] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is np-hard. WALCOM: Algorithms and Com- 
putation, pages 274-285, 2009. 

[11] EJ Nystr "o m. "U ber practical aufl o solution of integral equations with applications to boundary value problems. Acta 
Mathematica, 54(l):185-204, 1930. 

[12] M. Pavan and M. Pelillo. Efficient out-of-sample extension of dominant-set clusters. Advances in Neural Information Pro- 
cessing Systems, 17:1057-1064, 2005. 

[13] P. Perona and L. Zelnik-Manor. Self-tuning spectral clustering. Advances in neural information processing systems, 17:1601- 
1608, 2004. 

[14] O. Shamir and N. Tishby Spectral clustering on a budget. AISTATS, 201 1. 

[15] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions 
on,22(8):888-905, 2000. 

[16] G.W. Stewart and GW Stewart. Introduction to matrix computations, volume 441. Academic press New York, 1973. 

[17] A. Talwalkar, S. Kumar, and H. Rowley. Large-scale manifold learning. In Computer Vision and Pattern Recognition, 2008. 

CVPR 2008. IEEE Conference on, pages 1-8. IEEE, 2008. 
[18] L. Wang, C. Leckie, K. Ramamohanarao, and J. Bezdek. Approximate spectral clustering. Advances in Knowledge Discovery 

and Data Mining, pages 134-146, 2009. 
[19] X. Wu and V Kumar. The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC Data Mining and Knowledge Discovery 

Series. Taylor & Francis, 2009. 

[20] D. Yan, L. Huang, and M.I. Jordan. Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD interna- 
tional conference on Knowledge discovery and data mining, pages 907-916. ACM, 2009. 



20 



